Sampling Variability – The Heart of Inference

library(tidyverse)
── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr     1.1.2     ✔ readr     2.1.4
✔ forcats   1.0.0     ✔ stringr   1.5.0
✔ ggplot2   3.4.2     ✔ tibble    3.2.1
✔ lubridate 1.9.2     ✔ tidyr     1.3.0
✔ purrr     1.0.1     
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(infer)
library(scales)

Attaching package: 'scales'

The following object is masked from 'package:purrr':

    discard

The following object is masked from 'package:readr':

    col_factor
library(stringr)
library(openintro)
Loading required package: airports
Loading required package: cherryblossom
Loading required package: usdata

Salaries of football coaches

knitr::include_graphics(here::here("slides", 
                                   "images", 
                                   "football.png"))
coaches <- read_csv(here::here("slides", 
                               "data", 
                               "cu_csu_coaches.csv")
                    ) %>% 
  filter('Base Pay' > 0)
Rows: 252 Columns: 11
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (4): Employee Name, Job Title, Notes, Agency
dbl (6): Base Pay, Overtime Pay, Other Pay, Benefits, Total Pay & Benefits, ...
lgl (1): Status

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
csu <- coaches %>% 
  filter(Agency == "California State University")

uc <- coaches %>% 
  filter(Agency == "University of California")

Sampling Strategies

What types of samples could we collect? Are some methods “better” than other methods?

At your table…

First

  • each person samples 10 salaries
  • calculate the median

Then

  • calculate the median of all 25 salaries

Each table has a sample of 25 UC & CSU coach salaries.


Would you feel comfortable inferring that the median salary of your sample is close to the median salary of all UC & CSU coaches?


Why or why not?

Why sample more than once?

Variability is a central focus of the discipline of Statistics!

Making decisions based on limited information is uncomfortable!

You likely weren’t willing to infer the population median salary from your sample!

Sampling Framework

population – collection of observations / individuals we are interested in

population parameter – numerical summary about the population that is unknown but you wish you knew


sample – a collection of observations from the population

sample statistic – a summary statistic computed from a sample that estimates the unknown population parameter.

Statistical Inference

There were 252 “Head Coaches” at University of California and California State Universities in 2019 (that satisfied my search criteria)


Median salary for all 252 coaches

$137,619

Inferring information from your sample onto the population is called statistical inference.

Statistical Inference Reasoning

  • If the sampling is done at random
  • the sample is representative of the population
  • any result based on the sample can generalize to the population
  • the point estimate is a “good guess” of the unknown population parameter



Shouldn’t one random sample be enough then? Isn’t that what we use to make confidence intervals and do hypothesis tests?

Virtual Sampling

rep_sample_n(coaches, 
             size = 25, 
             reps = 1, 
             replace = TRUE)


Employee Name Job Title Total Pay & Benefits
David Bradley Kreutzkamp Head Coach 5 105683.0
Daniel Dykes Head Coach 5 540000.0
Daniel Conners Head Coach 5 156181.0
Lauren Beth Nadler Head Coach 5 44714.0
Richard A Gallien HEAD COACH - 12 MONTH 103189.9
Gregory L Kamansky HEAD COACH - 12 MONTH 198151.7

\(\vdots\)

Distribution of 1000 medians from samples of 25 coaches

virtual_samples25 <- coaches %>% 
  rep_sample_n(size = 25, reps = 1000)

virtual_med25 <- virtual_samples25 %>% 
  group_by(replicate) %>% 
  summarize(median = median(`Total Pay & Benefits`)) %>% 
  mutate(samps = "25")

virtual_samples50 <- coaches %>% 
  rep_sample_n(size = 50, reps = 1000)

virtual_med50 <- virtual_samples50 %>% 
  group_by(replicate) %>% 
  summarize(median = median(`Total Pay & Benefits`)) %>% 
  mutate(samps = "50")

virtual_samples100 <- coaches %>% 
  rep_sample_n(size = 100, reps = 1000)

virtual_med100 <- virtual_samples100 %>% 
  group_by(replicate) %>% 
  summarize(median = median(`Total Pay & Benefits`)) %>% 
  mutate(samps = "100")

master_samples <- bind_rows(virtual_med25, 
                            virtual_med50, 
                            virtual_med100) 
master_samples %>% 
 filter(samps == "50") %>% 
  ggplot(mapping = aes(x = median)) + 
  geom_histogram(binwidth = 6500, color = "white") + 
  scale_x_continuous(labels = scales::dollar_format()) +
  labs(x = "Median Salary")

Sampling Distributions

  • Visualize the effect of sampling variation on the distribution of any point estimate
    • In this case, the sample median
  • We can use sampling distributions to make statements about what values we can typically expect.

Be careful! A sampling distribution is different from a sample’s distribution!

Distributions of 1000 medians from different sample sizes

master_samples %>% 
  mutate(samps = factor(samps, 
                        levels = c("25", "50", "100")
                        )
         ) %>% 
  ggplot(mapping = aes(x = median)) + 
  geom_histogram(binwidth = 6500, color = "white") + 
  facet_wrap(~samps) + 
  scale_x_continuous(labels = scales::dollar_format()) +
  labs(x = "Median Salary") +
  theme(axis.text.x = element_text(hjust = 1, 
                                   vjust = 1, 
                                   angle = 45))

What differences do you see?

Variability for Different Sample Sizes

Sample Size Standard Error of Median
25 19342.081
50 12459.358
100 8279.311
  • Standard errors quantify the variability of point estimates

  • As a general rule, as sample size increases, the standard error decreases.

Careful! There are important differences between standard errors and standard deviations.

A good guess?

master_samples %>% 
  mutate(samps = factor(samps, 
                        levels = c("25", "50", "100")
                           )
         ) %>% 
  ggplot(mapping = aes(x = median)) + 
  geom_histogram(binwidth = 6500, color = "white") + 
  geom_vline(xintercept = median(coaches$`Total Pay & Benefits`), 
             color = "red", 
             linewidth = 1.5) +
  facet_wrap(~samps, ) + 
  scale_x_continuous(labels = scales::dollar_format()) +
  labs(x = "Median Salary")

Precision & Accuracy

  • Random sampling ensures our point estimates are accurate.


  • Larger sample sizes ensure our point estimates are precise.

Sampling Activity!